Notes:
Notes: follow along on what the ggplot syntax consists of
library(ggplot2)
pf <- read.delim('pseudo_facebook.tsv')
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point()
Response:the majority of ages with large friend groups are in the 20s and 100s.
Notes: difference between qplot and ggplot is that ggplot allows you to make more complex plots but you would have to specify what type of geom plot and use the aesthetic wrapper
qplot(x= age, y= friend_count, data= pf)
ggplot(aes(x= age, y=friend_count), data= pf) + geom_point()
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
Notes: alpha is good for overplotting since it takes up, for example, 20 points to make one point if alpha = 1/20. geom_jitter is used for discrete, not continuous points.
ggplot(aes(x= age, y=friend_count), data= pf) + geom_jitter(alpha = 1/20) + xlim(13,90)
## Warning: Removed 5172 rows containing missing values (geom_point).
Response:there is a less amount of people who have a high friend count.
Notes: we need to change geom_jitter to geom_point and add coord_trans(y= ‘sqrt’) at the end. we use geom_point instead of jitter because that will add negative noise to our graph and there’s no such thing as a negative age. Then pass in position parameter within geom_point so that we do not get any negative numers.
?coord_trans
ggplot(aes(x= age, y=friend_count), data= pf) + geom_point(alpha = 1/20, position = position_jitter(h=0)) + xlim(13,90) +
coord_trans(y= 'sqrt')
## Warning: Removed 5190 rows containing missing values (geom_point).
a lot of people around age 70 have friend count less than 1000 ***
Notes: Explore the relationship between friendships_initiated and age
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
ggplot(aes(x= age, y=friendships_initiated), data= pf) + geom_point(alpha = 1/10, position = position_jitter(h=0)) + xlim(13,90) + coord_trans(y= 'sqrt')
## Warning: Removed 5191 rows containing missing values (geom_point).
Notes: A lot of people underestimate how many people see their facebook post. The graph shows perceived audience size vs actual audience size (percentage) ***
Notes:Important Notice! Please note that in newer versions of dplyr (0.3.x+), the syntax %.% has been deprecated and replaced with %>%.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pf.fc_by_age <- pf %>%
group_by(age) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age)
head(pf.fc_by_age, 20)
## # A tibble: 20 x 4
## age friend_count_mean friend_count_median n
## <int> <dbl> <dbl> <int>
## 1 13 165. 74.0 484
## 2 14 251. 132. 1925
## 3 15 348. 161. 2618
## 4 16 352. 172. 3086
## 5 17 350. 156. 3283
## 6 18 331. 162. 5196
## 7 19 334. 157. 4391
## 8 20 283. 135. 3769
## 9 21 236. 121. 3671
## 10 22 211. 106. 3032
## 11 23 203. 93.0 4404
## 12 24 186. 92.0 2827
## 13 25 131. 62.0 3641
## 14 26 144. 75.0 2815
## 15 27 134. 72.0 2240
## 16 28 126. 66.0 2364
## 17 29 121. 66.0 1936
## 18 30 115. 67.5 1716
## 19 31 118. 63.0 1694
## 20 32 114. 63.0 1443
Create your plot!
ggplot(aes(x=age, y=friend_count_mean), data= pf.fc_by_age) + geom_line()
Notes:ggplot 2.0.0 changes the syntax for parameter arguments to functions when using stat = ‘summary’. To denote parameters that are being set on the function specified by fun.y, use the fun.args argument, e.g.: ggplot( … ) + geom_line(stat = ‘summary’, fun.y = quantile, fun.args = list(probs = .9), … ) To zoom in, the code should use thecoord_cartesian(xlim = c(13, 90)) layer rather than xlim(13, 90) layer.
ggplot(aes(x=age, y=friend_count), data = pf) +
coord_cartesian(xlim = c(13, 90)) +
geom_point(alpha=0.05,
position = position_jitter(h=0),
color = 'orange') +
coord_trans(y= 'sqrt') +
geom_line(stat= 'summary', fun.y=mean) +
geom_line(stat= 'summary', fun.y=quantile, fun.args = list(probs = .1), linetype = 2, color = 'blue') +
geom_line(stat= 'summary', fun.y=quantile, fun.args = list(probs = .9), linetype = 2, color = 'blue') +
geom_line(stat= 'summary', fun.y=quantile, fun.args = list(probs = .5), color = 'blue')
Response:the middle 50% geom_line is slightly below the mean line. it is probably because the mean is skewed one direction due to outliers as opposed to the median (50%) line that do not take into account the outliers.
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes: People dont underestimate as badly when asked how many people do you think saw your post in the whole month?
Notes:
cor.test(x=pf$age, y=pf$friend_count, method= 'pearson')
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Notes:
with(subset(pf, age<=70), cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes: https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/ ***
Notes:
ggplot(aes(x=www_likes_received, y= likes_received), data = pf) + geom_point() +
scale_x_continuous(limits = c(0,25000)) + scale_y_continuous(limits = c(0, 30000))
## Warning: Removed 12 rows containing missing values (geom_point).
Notes: ‘lm’ stands for linear model
ggplot(aes(x=www_likes_received, y= likes_received), data = pf) + geom_point() +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
geom_smooth(method = 'lm', color= 'red')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
cor.test(x=pf$www_likes_received, y=pf$likes_received, method= 'pearson')
##
## Pearson's product-moment correlation
##
## data: pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response: 0.948 ***
Notes:
Notes:
library(alr3)
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
summary(Mitchell)
## Month Temp
## Min. : 0.00 Min. :-7.4778
## 1st Qu.: 50.75 1st Qu.:-0.3486
## Median :101.50 Median :10.4500
## Mean :101.50 Mean :10.3125
## 3rd Qu.:152.25 3rd Qu.:20.4306
## Max. :203.00 Max. :27.6056
Create your plot!
ggplot(aes(x=Month, y=Temp), data= Mitchell) + geom_point()
Take a guess for the correlation coefficient for the scatterplot. 0
What is the actual correlation of the two variables? (Round to the thousandths place) 0.057
cor.test(x=Mitchell$Month, y=Mitchell$Temp)
##
## Pearson's product-moment correlation
##
## data: Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes: Break up the x-axis so that every 12 months corresponds to a year. What layer would you add to your existing code to do this?
ggplot(aes(x=Month, y=Temp), data= Mitchell) + geom_point() + scale_x_continuous(breaks = seq(0, 203, 12))
What do you notice? Response: Theres a cyclical graph similar to sinusoidal graph
Watch the solution video and check out the Instructor Notes! Notes: ggplot(aes(x=(Month%%12),y=Temp), data=Mitchell)+ geom_point() ***
Notes: age_with_months = age + months in decimal form
pf$age_with_months <- pf$age + (12- pf$dob_month) / 12
age_groups_with_months <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_groups_with_months,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
Programming Assignment
ggplot(aes(x=age_with_months, y= friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) + geom_line()
Notes:If we increase bin width, we get a smoother line in our graph. We lose data that we can see. Thus, there is a geom_smooth() function to show the smooth line on graphs.
p1 <- ggplot(aes(x=age, y= friend_count_mean), data = subset(pf.fc_by_age, age < 71)) + geom_line() + geom_smooth()
p2 <- ggplot(aes(x=age_with_months, y= friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) + geom_line() + geom_smooth()
p3 <- ggplot(aes(x= round(age / 5) * 5, y= friend_count), data = subset(pf, age < 71)) + geom_line(stat = 'summary', fun.y=mean)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p2,p1,p3, ncol= 1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
Notes: There is no best graph to choose. Each graph shows different data than another graph.
Reflection: A lot of functions such as grid extra, ggplots, and coefficients.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes! install.packages(‘knitr’, dependencies = TRUE)